Stochastic Consensus Clustering
نویسندگان
چکیده
The clustering method described in this paper is meant to aid the researcher who, for any of a variety of reasons, has clustered a data set a large number of times and is now faced with the problem of reconciling many different clustering results into a single, robust clustering solution. In this paper a clustering method is developed that takes the results of these many data clusterings, properly arranges them to form a similarity matrix, and then uses the variable aggregation ideas of Herbert Simon and Albert Ando to provide a single solution. Once the germane parts of Simon-Ando theory are reviewed, an algorithm is developed and tested. I. PROBLEM DESCRIPTION For a variety of reasons, we may cluster a data set multiple times. We may use a clustering method (e.g. k-means or non-negative matrix factorization (NMF)) that does not return a unique solution and wish to see a variety of such solutions. We may be unsure of which, among the many clustering algorithms, is best suited to the application, and so decide to cluster the data with a number of them. We may be unsure of k, the number of clusters to be found, and so decide to repeatedly run an algorithm using a range of reasonable values. Lastly, we may be overwhelmed by the large number of parameter values most commercially available clustering software allow and decide to repeatedly run the algorithm, each time with a different parameter set. Note that these reasons are not incompatible as we may cluster a data set many times for any combination of these reasons. The clustering problem becomes one of reconciling these multiple results to obtain a final answer. One approach for solving this problem is to determine a clustering that is as close as possible to all the clusterings already found. This is an optimization problem known as median partition which is NP-complete. A number of heuristics for the median partition problem exist and a comparison of their performance can be found in [1]. Another approach that has been studied is a consensus clustering framework based on variational Bayes Mixture of Gaussians [2]. A third approach involves storing all the clustering results in matrix form. For example, the result from each run of a clustering algorithm is stored in an adjacency matrix A, where Aij = 1 if data set elements i and j are in the same cluster and Aij = 0 otherwise. The collection of these adjacency matrices can be used to represent the connections between the original data as a hypergraph, and clusters can be discovered through hypergraph partitioning algorithms [3]. Alternatively, we can create a matrix S, the sum of all these adjacency matrices, and then aim to cluster the original data based on the contents of S (often each element of S is divided by the number of clustering runs to create a matrix with entries in the interval [0, 1]). Methods that have been used for this problem include clustering S using singlelink hierarchical clustering after zeroing out entries below a certain threshold [4]. A new approach introduced in this paper is based on the observation that the structure of S, as constructed above, is that of a nearly completely decomposable matrix, and thus the results of Simon-Ando theory are applicable. Simon-Ando theory assumes that we have knowledge of the structure of a system (i.e. the clusters), which allows us to make predictions about the behavior of the system over time. The key insight for creating this new clustering method is that we can consider the problem from the opposite direction, that is, we can look at the behavior of the system over time and use it to determine the clusters. II. OVERVIEW OF SIMON-ANDO THEORY Simon and Ando [5] developed the theory for the behavior of the vector xt defined by
منابع مشابه
Entropy-based Consensus for Distributed Data Clustering
The increasingly larger scale of available data and the more restrictive concerns on their privacy are some of the challenging aspects of data mining today. In this paper, Entropy-based Consensus on Cluster Centers (EC3) is introduced for clustering in distributed systems with a consideration for confidentiality of data; i.e. it is the negotiations among local cluster centers that are used in t...
متن کاملAsymptotic Behavior of Mean Partitions in Consensus Clustering
Although consistency is a minimum requirement of any estimator, little is known about consistency of the mean partition approach in consensus clustering. This contribution studies the asymptotic behavior of mean partitions. We show that under normal assumptions, the mean partition approach is consistent and asymptotic normal. To derive both results, we represent partitions as points of some geo...
متن کاملConsensus clustering in complex networks
The community structure of complex networks reveals both their organization and hidden relationships among their constituents. Most community detection methods currently available are not deterministic, and their results typically depend on the specific random seeds, initial conditions and tie-break rules adopted for their execution. Consensus clustering is used in data analysis to generate sta...
متن کاملDetermining the Number of Clusters via Iterative Consensus Clustering
We use a cluster ensemble to determine the number of clusters, k, in a group of data. A consensus similarity matrix is formed from the ensemble using multiple algorithms and several values for k. A random walk is induced on the graph defined by the consensus matrix and the eigenvalues of the associated transition probability matrix are used to determine the number of clusters. For noisy or high...
متن کاملBagging for Path-Based Clustering
A resampling scheme for clustering with similarity to bootstrap aggregation (bagging) is presented. Bagging is used to improve the quality of pathbased clustering, a data clustering method that can extract elongated structures from data in a noise robust way. The results of an agglomerative optimization method are influenced by small fluctuations of the input data. To increase the reliability o...
متن کاملImproving cluster analysis by co-initializations
Many modern clustering methods employ a non-convex objective function and use iterative optimization algorithms to find local minima. Thus initialization of the algorithms is very important. Conventionally the starting guess of the iterations is randomly chosen; however, such a simple initialization often leads to poor clusterings. Here we propose a new method to improve cluster analysis by com...
متن کامل